{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Further Hypothesis Testing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Select this cell and type Ctrl-Enter to execute the code below.\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "data = pd.read_csv(\"stars.csv\")\n",
    "type_key = ['Brown Dwarf', 'Red Dwarf', 'White Dwarf', 'Main Sequence', 'Supergiant','Hypergiant']\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 - Comparing variances of two groups"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The t-test only compares the *means* of the two groups. \n",
    "Next, Dr Howe would like you to check their *variances*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Question: do types 4 and 5 have the same variance in log(luminosity)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q-Q plot\n",
    "\n",
    "The [*quantile-quantile plot*](https://en.wikipedia.org/wiki/Q–Q_plot) (Q-Q plot) is a simple, graphical method to check whether two sets of observations appear to come from the same distribution, or to compare one set of data to a theoretical distribution.\n",
    "\n",
    "It is made by plotting the quantiles (i.e. percentiles) of the two distributions against each other.\n",
    "\n",
    "If the variances are the same, the Q-Q plot will approximate a straight line with gradient 1.\n",
    "\n",
    "We can find the percentiles for our sample directly with `pandas`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type4 = data[data.type == 4].luminosity.apply(np.log)\n",
    "type5 = data[data.type == 5].luminosity.apply(np.log)\n",
    "\n",
    "x = np.linspace(0,1,101)\n",
    "t4q = np.array(type4.quantile(x))\n",
    "t5q = np.array(type5.quantile(x))\n",
    "\n",
    "f = plt.figure(figsize=(12, 4))\n",
    "\n",
    "ymin=11\n",
    "ymax=14\n",
    "\n",
    "ax = plt.subplot(121)\n",
    "ax.set_xlabel('quantile')\n",
    "ax.set_ylabel('type 4')\n",
    "ax.set_ylim([ymin,ymax])\n",
    "plt.scatter(x,t4q,color='C4')\n",
    "\n",
    "ax = plt.subplot(122)\n",
    "ax.set_xlabel('quantile')\n",
    "ax.set_ylabel('type 5')\n",
    "ax.set_ylim([ymin,ymax])\n",
    "plt.scatter(x,t5q,color='C5')\n",
    "\n",
    "plt.show()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Plotting type 5 against type 4:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "xlab = 'type 4'\n",
    "ylab = 'type 5'\n",
    "\n",
    "ax = plt.axes()\n",
    "ax.set_xlabel(xlab)\n",
    "ax.set_ylabel(ylab)\n",
    "\n",
    "plt.scatter(t4q, t5q, color='gray')\n",
    "\n",
    "# the line with gradient 1, passing through Q50:\n",
    "m = 1\n",
    "c = t5q[50] - t4q[50] * m\n",
    "plt.plot(t4q, m * t4q + c, color='black')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This looks like a pretty good fit, but we can do an [*F-test*](https://en.wikipedia.org/wiki/F-test_of_equality_of_variances) to be more rigorous:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### F-test for equality of variances"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Theory\n",
    "\n",
    "Once again, we need a two-tailed test:\n",
    "\n",
    "$H_0$: The two populations have identical variance  $\\sigma^2 = \\sigma_1^2 = \\sigma_2^2$.\n",
    "\n",
    "$H_1$: The two populations have non-identical variances, $\\sigma_1^2 \\ne \\sigma_2^2$.\n",
    "\n",
    "The test statistic is simply the ratio of the sample variances:\n",
    "\n",
    "$$F = \\frac{s_1^2}{s_2^2}$$.\n",
    "\n",
    "Under $H_0$, $F$ follows an [*F-distribution*](https://en.wikipedia.org/wiki/F-distribution) with parameters $(n_1 - 1,n_2 - 1)$.\n",
    "\n",
    "We use this distribution to calculate a p-value for the observed value of the test statistic, $F$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Assumpions\n",
    "\n",
    "- The two samples both follow normal distributions.\n",
    "\n",
    "Note that the means of the two populations may differ.\n",
    "The F-test is highly sensitive to deviations from the assumption of normality.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Application\n",
    "\n",
    "We will set $\\alpha=0.05$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can visualise the F-distribution corresponding to our example ($n_1 = n_2 = 40$):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.linspace(0.1,3)\n",
    "plt.plot(x,stats.f.pdf(x,39,39), color='gray')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We will calculate the test statistic:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fstat = type4.var()/type5.var()\n",
    "print(\"F =\",fstat)\n",
    "\n",
    "plt.plot(x,stats.f.pdf(x,39,39), color='gray')\n",
    "x_region = np.linspace(0.1,fstat,100)\n",
    "plt.fill_between(x_region,stats.f.pdf(x_region,39,39),color='lightgrey')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We use the CDF to calculate the left-tail $p$-value and double it for a two-tailed test:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p_value = stats.f.cdf(fstat,39,39) * 2\n",
    "print(\"p =\",p_value)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here, $p>\\alpha$ so we accept the null hypothesis of equal variance, at the 5% level.\n",
    "\n",
    "<br>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
